The bag of communities: Identifying abusive behavior online with preexisting internet data

Chandrasekharan, E.; Samory, M.; Srinivasan, A.; Gilbert, E.

doi:10.1145/3025453.3026018

Since its earliest days, harassment and abuse have plagued the Internet. Recent research has focused on in-domain methods to detect abusive content and faces several challenges, most notably the need to obtain large training corpora. In this paper, we introduce a novel computational approach to address this problem called Bag of Communities (BoC) - a technique that leverages large-scale, preexisting data from other Internet communities. We then apply BoC toward identifying abusive behavior within a major Internet community. Specifically, we compute a post's similarity to 9 other communities from 4chan, Reddit, Voat and MetaFilter. We show that a BoC model can be used on communities "off the shelf" with roughly 75% accuracy - no training examples are needed from the target community. A dynamic BoC model achieves 91.18% accuracy after seeing 100, 000 human-moderated posts, and uniformly outperforms in-domain methods. Using this conceptual and empirical work, we argue that the BoC approach may allow communities to deal with a range of common problems, like abusive behavior, faster and with fewer engineering resources.

The bag of communities: Identifying abusive behavior online with preexisting internet data / Chandrasekharan, E.; Samory, M.; Srinivasan, A.; Gilbert, E.. - 2017-:(2017), pp. 3175-3187. (Intervento presentato al convegno 2017 ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 2017 tenutosi a Denver, CO, USA) [10.1145/3025453.3026018].

The bag of communities: Identifying abusive behavior online with preexisting internet data

Chandrasekharan E.;Samory M.;Srinivasan A.;Gilbert E.

2017

Abstract

Since its earliest days, harassment and abuse have plagued the Internet. Recent research has focused on in-domain methods to detect abusive content and faces several challenges, most notably the need to obtain large training corpora. In this paper, we introduce a novel computational approach to address this problem called Bag of Communities (BoC) - a technique that leverages large-scale, preexisting data from other Internet communities. We then apply BoC toward identifying abusive behavior within a major Internet community. Specifically, we compute a post's similarity to 9 other communities from 4chan, Reddit, Voat and MetaFilter. We show that a BoC model can be used on communities "off the shelf" with roughly 75% accuracy - no training examples are needed from the target community. A dynamic BoC model achieves 91.18% accuracy after seeing 100, 000 human-moderated posts, and uniformly outperforms in-domain methods. Using this conceptual and empirical work, we argue that the BoC approach may allow communities to deal with a range of common problems, like abusive behavior, faster and with fewer engineering resources.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2017
			
	Nome convegno
	
				2017 ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 2017
			
	Parole chiave
	
				Abusive behavior; Machine learning; Moderation; Online communities; Social computing
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				The bag of communities: Identifying abusive behavior online with preexisting internet data / Chandrasekharan, E.; Samory, M.; Srinivasan, A.; Gilbert, E.. - 2017-:(2017), pp. 3175-3187. (Intervento presentato al  convegno 2017 ACM SIGCHI Conference on Human Factors in Computing Systems, CHI 2017 tenutosi a Denver, CO, USA) [10.1145/3025453.3026018].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1655754

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

100

52

Catalogo dei prodotti della ricerca